Analyzing the phonological sequences of each lect

First, I set up the environment of R Markdown.

knitr::opts_chunk$set(echo = TRUE, max.print = 100)

Then I load the required R packages. For this R script, I mostly used the grammar of Tidytable (Fairbanks 2022), which allows me to use Data.table (Dowle and Srinivasan 2022) in the syntax of Tidyverse (Wickham et al. 2019). For data visualization, I used ggplot2 (Wickham 2016).

library(readxl)
library(data.table)
library(tidytable)
library(stringr)
library(stringi)
library(geosphere)
library(plotly)
library(geodata)
library(plyr)
library(tibble)
library(forcats)
library(purrr)
library(Rfast)

For measuring the phonological distances, I have limited the sample lects to those with “full” phonotactic information coded in the database, excluding those whose some slots are missing or whose certain phonological sequences are represented by underspecified segments (such as PL representing plosive plus liquid). Thus, the sample lects investigated here are a substantially smaller subset of the whole sample of Phonotacticon 1.0. Here, I load the Eurasian lects with “full” information.

I excluded lects that have sequences including more than two consecutive “C” symbols (representing “consonant”), such as CC or CCC. This is because such transcriptions would lead to too many possible sequences. For example, one of the English coda sequences is /CCCs/. Obviously, not every logically possible triconsonantal clusters plus /s/ exists in English: It would be absurd to claim that /ssss/ is a possible coda in English. Thus, lects whose sequences include consecutive C’s were excluded.

Likewise, lects that have bracketed segments with too many members were excluded. For example, one of the possible onsets of Czech is \[pbfvmtdszncɟʃʒɲkɡxhlrr̝j\]\[pbtdɡfvszʃʒxjrr̝lmnɲ\]. This means a biconsontal sequence whose first member is any one of the 23 segments within the first brackets followed by any one of the 19 segments within the second brackets. As this would generate 437 logically possible sequences, including many onset sequences that do not exist in Czech (such as /pb/, /bt/, or /fm/), all sequences involving ten or more segments within brackets followed by ten or more segments within brackets were excluded.

Eurasia <- read_xlsx("Phonotacticon.xlsx") %>% 
  as.data.table() %>% 
  filter(Macroarea == 'Eurasia',
  Complete, 
  complete.cases(P), 
         O != '?', 
         N != '?',
         C != '?',
         !grepl('C{2,}', O), 
         !grepl('C{2,}', C),
         !grepl('\\[.{10}.*?\\]\\[.{10}.*?\\]', O),
         !grepl('\\[.{10}.*?\\]\\[.{10}.*?\\]', C)) %>% 
  select(-Note, -Complete, -Macroarea)

Eurasia

I also load PhonoBib, a bibliography of the Eurasian lects containing the bibliographical information as well as the longitude and latitude and language family.

PhonoBib <- read_xlsx("PhonoBib.xlsx") %>% 
  as.data.table()

PhonoBib[1:5, 1:10]

I load a modified version of PanPhon Mortensen et al. (2016) and sort out duplicate rows.

PanPhon <- fread("PanPhonPhonotacticon1_0.csv") %>% 
  as.data.table() %>% 
  unique(by = 'ipa')

PanPhon

I arrange PanPhon segments in alphabetical order.

PanPhonOrder <- PanPhon$ipa[
                  order(-nchar(PanPhon$ipa), 
                      PanPhon$ipa)]

head(PanPhonOrder)
## [1] "h͡d̪͡ɮ̪ʲʷ" "h͡d̪͡ɮ̪ʷː" "h͡d̪͡ɮ̪ʷˠ" "h͡d̪͡ɮ̪ʷˤ" "h͡d̪͡z̪ʲʷ" "h͡d̪͡z̪ʷː"

I create a regex line of PanPhon in order to split the segments from sequences. A sequence is defined as any number of consecutive segments that are used as a form of onset, nucleus, or coda in any lect. A segment is defined as the components of a sequence. For example, /spl/ is an English onset sequence, consisting of three segments: /s p l/. /p/ is also an English onset sequence, consisting of only one segment: /p/.

PanPhonRegex <- paste0("(?:",
                       paste(PanPhonOrder, collapse="|"),
                      '|B|C|Č|F|G|Ł|L|N|O|P|R|S|T|V|W|X|Z',
                       ")")

str_trunc(PanPhonRegex, 100)
## [1] "(?:h͡d̪͡ɮ̪ʲʷ|h͡d̪͡ɮ̪ʷː|h͡d̪͡ɮ̪ʷˠ|h͡d̪͡ɮ̪ʷˤ|h͡d̪͡z̪ʲʷ|h͡d̪͡z̪ʷː|h͡d̪͡z̪ʷˠ|h͡d̪͡z̪ʷˤ|h͡t̪͡ɬ̪ʲʷ|h͡t̪..."

I also create PanPhon regex including brackets, in order to detect segments within brackets (e. g. [ptk] meaning “/p/, /t/, or /k/”.)

PanPhonRegexBrackets <- paste0('(?:',
                               '(?<=\\[).*?(?=\\])|',
                               paste(PanPhonOrder, collapse="|"),
                               '|B|C|Č|F|G|Ł|L|N|O|P|R|S|T|V|W|X|Z',
                               ')')

str_trunc(PanPhonRegexBrackets, 100)
## [1] "(?:(?<=\\[).*?(?=\\])|h͡d̪͡ɮ̪ʲʷ|h͡d̪͡ɮ̪ʷː|h͡d̪͡ɮ̪ʷˠ|h͡d̪͡ɮ̪ʷˤ|h͡d̪͡z̪ʲʷ|h͡d̪͡z̪ʷː|h͡d̪͡z̪ʷˠ|h͡d̪͡z̪..."

I define “classes”, i. e. underspecified segments transcribed in capitals (e. g. P for plosives).

Classes <- PanPhon %>% 
  mutate(B = cons == 1 & lab == 1,
        C = cons == 1,
        Č = cons == 1 & delrel == 1 & son == -1,
        `F` = cons == 1 & cont == 1 & son == -1,
        G = grepl('j|w|ɥ|ɰ', ipa),
        Ł = cons == 1 & cor == 1 & lat == 1,
        L = cons == 1 & cont == 1 & cor == 1 & son == 1,
        N = nas == 1 & syl == -1,
        O = cont == 1 & son == 1 & syl == -1 & !grepl('h|ɦ', ipa),
        P = cons == 1 & cont == -1 & delrel == -1 & son == -1,
        R = cons == 1 & cont == 1 & cor == 1 & lat == -1 & son == 1,
        S = cons == 1 & cont == 1 & cor == 1 & son == -1,
        `T` = cons == 1 & son != 1,
        V = cons == -1 & cont == 1 & son == 1 & syl == 1,
        W = syl == -1 & voi == 1,
        X = syl == -1 & voi == -1,
        Z = cont == 1 & syl == -1) %>% 
  select(ipa, B, C, Č, `F`, G, Ł, L, N, O, P, R, S, `T`, V, W, X, Z) %>%
  melt(id.vars = 'ipa',
     variable.name = 'Class',
     value.name = 'Value') %>% 
  filter(Value) %>% 
  select(-Value)

Classes

I extract phonemes from the phonemic inventories.

Phonemes <- stri_extract_all_regex(Eurasia$P, 
                         pattern = PanPhonRegex,
                         simplify = TRUE) %>% 
  as.data.table() %>% 
  mutate(Lect = Eurasia$Lect) %>% 
  melt(id.vars = 'Lect',
       variable.name = 'Number',
       value.name = 'ipa') %>% 
  select(-Number) %>% 
  filter(ipa != '')

Phonemes

I subset lect, onsets, nuclei, and codas from Phonotacticon.

LectONC <- Eurasia %>% 
  select(Lect, O, N, C) %>% 
  melt(id.vars = 'Lect',
       variable.name = 'Category',
       value.name = 'Sequence')

LectONC

I extract the sequences from onset, nucleus, and coda categories.

Sequences <- LectONC[, tstrsplit(Sequence, ' ', fixed = FALSE)] %>% 
  mutate(Lect = LectONC$Lect,
         Category = LectONC$Category) %>% 
  melt(id.vars = c('Lect', 'Category'),
       variable.name = 'Number',
       value.name = 'Sequence') %>% 
  select(-Number) %>% 
  filter(!is.na(Sequence)) %>% 
  distinct()

Sequences

I subset sequences that include underspecified segments (transcribed in capital letters).

Capitals <- 
  Sequences %>% 
  filter(grepl('B|C|Č|F|G|Ł|L|N|O|P|R|S|T|V|W|X|Z', Sequence)) %>% 
  select(-Category) %>% 
  distinct()

Capitals

I convert the capital letters into the corresponding phonemes in each lect. For example, P (“plosive”) in Italian is converted to all the plosive phonemes in Italian phonemic inventory.

Decapitalized <- 
  stri_extract_all_regex(Capitals$Sequence,
                         pattern = PanPhonRegex,
                         simplify = TRUE) %>% 
  as.data.table() %>% 
  mutate(Lect = Capitals$Lect,
         Sequence = Capitals$Sequence) %>% 
  melt(id.vars = c('Lect', 'Sequence'),
        variable.name = 'Order',
       value.name = 'Segment') %>%
  mutate(Order = Order %>% 
           as.factor() %>% 
           as.integer()) %>% 
  filter(Segment != '') %>% 
  left_join(Classes, by = c('Segment' = 'Class')) %>% 
  mutate(ipa = if_else(is.na(ipa), Segment, ipa)) %>% 
  select(-Segment) %>% 
  inner_join(Phonemes) %>% 
  setorder(col = Order) %>% 
  split(by = c('Lect', 'Sequence')) %>% 
  lapply(function(x)
           split(x, by = 'Order')) %>% 
  lapply(function(x)
    lapply(x, function(x)
      x <- x$ipa)) %>% 
  lapply(function(x)
    expand.grid(x) %>%
      do.call(what = paste0)) %>% 
  enframe() %>% 
  unnest() %>% 
  as.data.table() %>% 
  separate(col = name,
           into = c('Lect', 'Sequence'),
           sep = '\\.') %>% 
  setnames('value', 'NewSequence') %>% 
  full_join(Sequences) %>% 
  mutate(Sequence =
           if_else(!is.na(NewSequence),
                   NewSequence,
                   Sequence)) %>% 
  select(-NewSequence)

Decapitalized

I split the sequences into segments, including bracketed segments (such as [ptk] for “/p/, /t/, or /k/.)

ToUnbracket <- stri_extract_all_regex(Decapitalized$Sequence, 
                         pattern = PanPhonRegexBrackets,
                         simplify = TRUE) %>% 
  as.data.table() %>% 
  mutate(Lect = Decapitalized$Lect,
         Category = Decapitalized$Category,
         Sequence = Decapitalized$Sequence) %>% 
  melt(id = c('Lect', 'Category', 'Sequence'),
       variable.name = 'Order',
       value.name = 'ipa') %>% 
  mutate(Order = Order %>% 
           as.factor() %>% 
           as.integer()) %>% 
  filter(ipa != "")

ToUnbracket

I subset bracketed sequences.

Bracketed <- ToUnbracket %>%
  filter(grepl('\\[', Sequence))

Bracketed

I convert the bracketed sequences into all logically possible sequences. For example, Laven’s sequence \[bdɟɡ\]\[rl\] is converted into /br/, /bl/, /dr/, /dl/, /ɟr/ /ɟl/, /ɡr/, and /ɡl/.

Unbracketed <- Bracketed$ipa %>% 
  stri_extract_all_regex(pattern = PanPhonRegex, simplify = TRUE) %>% 
  as.data.table() %>% 
  mutate(Sequence = Bracketed$Sequence,
         Order = Bracketed$Order) %>% 
  melt(id.vars = c('Sequence', 'Order'),
       variable.name = 'Number',
       value.name = 'ipa') %>%
  filter(ipa != '') %>% 
  select(-Number) %>% 
  setorder(col = Order) %>% 
  split(by = 'Sequence') %>% 
  lapply(function(x)
    split(x, by = 'Order')) %>% 
  lapply(function(x)
    lapply(x, function(x)
      x <- x$ipa)) %>% 
  lapply(function(x)
    expand.grid(x) %>%
      do.call(what = paste0)) %>% 
  enframe() %>% 
  unnest() %>% 
  setnames(c('name', 'value'),
           c('Sequence', 'NewSequence'))
  
Unbracketed

I join the unbracketed sequences into the whole list of sequences. Then I split the sequences into segments (e. g. /pl/ into /p/ and /l/).

Segments <- 
  stri_extract_all_regex(
  Unbracketed$NewSequence,
  pattern = PanPhonRegex,
  simplify = TRUE) %>% 
  as.data.table() %>% 
  mutate(Sequence = Unbracketed$Sequence,
         NewSequence = Unbracketed$NewSequence) %>% 
    pivot_longer(cols = -c(Sequence, NewSequence),
               names_to = 'Order',
               values_to = 'NewIPA') %>% 
    mutate(Order = Order %>% 
           as.factor() %>% 
           as.integer()) %>% 
  filter(NewIPA != '') %>% 
  full_join(ToUnbracket) %>% 
  mutate(Sequence = 
           if_else(
             !is.na(NewSequence),
             NewSequence,
             Sequence),
         ipa = 
           if_else(
             !is.na(NewIPA),
             NewIPA,
             ipa)) %>% 
  select(-NewSequence, -NewIPA)

Segments

Measuring the distance between sequences

In this section, I will show how I measure the distance between two sequences, e. g. between /pl/ and /spl/.

First, I measure the length of each sequence, in terms of the number of segments involved.

Sequences_length <- 
  Segments[, .(Length = max(Order)), .(Lect, Category, Sequence)]

Sequences_length

I join the length of each sequence to segments.

Segments <- left_join(Segments, Sequences_length)

Segments

I count the maximal length of all sequences, which is six.

MaxLength <- max(Sequences_length$Length)

MaxLength
## [1] 6

I count the number of all the split segments.

Segments_number <- nrow(Segments)

Segments_number
## [1] 44593

In order to measure the distance between two sequences of different length. I assign different “positions” to each sequence. As the maximal length of all sequences is six, a sequence of only one segment has six positions within these six slots. For example, the sequence /p/ would have the following six positions:

p_positions <- data.table(Slot1 = c('p', '', '', '', '', ''),
                          Slot2 = c('', 'p', '', '', '', ''),
                          Slot3 = c('', '', 'p', '', '', ''),
                          Slot4 = c('', '', '', 'p', '', ''),
                          Slot5 = c('', '', '', '', 'p', ''),
                          Slot6 = c('', '', '', '', '', 'p'))

p_positions

In the first row, where /p/ is assigned the leftmost position, the features of Slot 1 (lab1, voi1 …) are equivalent to the phonological features of /p/. In the remaining five slots (lab2, voi2, … lab6, voi6 …), the corresponding featural values are 0.

For the sequence /pl/, five positions are assigned within six slots:

pl_positions <- data.table(Slot1 = c('p', '', '', '', ''),
                          Slot2 = c('l', 'p', '', '', ''),
                          Slot3 = c('', 'l', 'p', '', ''),
                          Slot4 = c('', '', 'l', 'p', ''),
                          Slot5 = c('', '', '', 'l', 'p'),
                          Slot6 = c('', '', '', '', 'l'))

pl_positions

For the sequence /spl/, four positions are assigned within six slots:

spl_positions <- data.table(Slot1 = c('s', '', '', ''),
                          Slot2 = c('p', 's', '', ''),
                          Slot3 = c('l', 'p', 's', ''),
                          Slot4 = c('', 'l', 'p', 's'),
                          Slot5 = c('', '', 'l', 'p'),
                          Slot6 = c('', '', '', 'l'))

spl_positions

In order to compare the distance between /pl/ and /spl/, /pl/ in each of the five positions is mapped onto /spl/ in each of the four positions (5 * 4 = 20 comparisons). The table below shows the first five comparisons.

pl_vs_spl <- data.table(Comparison = rep(1:5, each = 2),
                        Slot1 = c('s', 'p', 's', '', 's', '', 's', '', 's', ''),
                        Slot2 = c('p', 'l', 'p', 'p', 'p', '', 'p', '', 'p', ''),
                        Slot3 = c('l', '', 'l', 'l', 'l', 'p', 'l', '', 'l', ''),
                        Slot4 = c('', '', '', '', '', 'l', '', 'p', '', ''),
                        Slot5 = c('', '', '', '', '', '', '', 'l', '', 'p'),
                        Slot6 = c('', '', '', '', '', '', '', '', '', 'l'))

pl_vs_spl                        

Among the five comparisons, the second comparison yields the minimal distance between /spl/ and /pl/, as /p/ is compared to /p/, /l/ is compared to /l/, and /s/ is compared to zero. (The comparison between segments in terms of phonological features will be discussed below). Thus, the second comparison is chosen as the distance between /pl/ and /spl/.1

I assign different positions to each sequence (from 0 to 5).

Sequences_rep <- bind_rows(rep(list(Segments), MaxLength))  %>% 
  mutate(Position = rep(0:(MaxLength - 1),
                        each = Segments_number)) %>% 
  mutate(Order = Order + Position) %>% 
  filter(Length + Position <= MaxLength) %>% 
  select(-Length)

Sequences_rep

I join segments with their phonological features (retrieved from PanPhon). Each feature is assigned the value of the position. For example, the features of /pl/ in the first position (position 0) has syl1, son1, cons1 … values for /p/ and syl2, son2, cons2 … values for /l/. The features of /pl/ in the second position (position 1) has syl2, son2, cons2 … values for /p/ and syl3, son2, cons3 … values for /l/, and so on.

Sequences_features <- Sequences_rep %>% 
  left_join(PanPhon, by = 'ipa') %>% 
               melt(id = c('Lect', 
                         'Category', 
                         'Sequence', 
                         'Order', 
                         'ipa', 
                         'Position'),
               variable.name = 'Feature',
               value.name = 'Value') %>% 
  mutate(Feature = paste0(Feature, Order)) %>% 
  dcast(Lect + Category + Sequence + Position ~ Feature,
        value.var = 'Value',
        fun.aggregate = sum,
        fill = 0) %>%
  mutate(SequencePosition = paste0(Sequence, Position)) %>% 
  select(-Lect, -Category, -Position, -Sequence) %>% 
  distinct()

Sequences_features[1:10, 1:10]

I then calculate the Euclidean distance between each pair of sequences.

Sequences_distance <- Sequences_features %>% 
  select(-SequencePosition) %>% 
  Dist() %>%
  as.data.table()

Sequences_distance[1:5, 1:5]

In order to name the rows and the columns of the distance matrix, I create a vector of all sequences in different positions.

SequenceVectors <- 
  str_replace(Sequences_features$SequencePosition, '[0-9]', '')

head(SequenceVectors)
## [1] "bl" "bl" "bl" "bl" "bl" "d"

I then use this vector to name the rows and the columns of the distance matrix.

setnames(Sequences_distance, SequenceVectors)

Sequences_distance[, Sequence.x := SequenceVectors]

Sequences_distance[1:5, 1:5]

As I have shown above, I need to choose the minimal distance between two sequences mapped onto each other in different positions. (For example, the minimal distance between /pl/ and /spl/ would arise when /s/ is mapped onto zero, /p/ is mapped onto /p/, and /l/ is mapped onto /l/.) Thus I calculate the minimal distance per each sequence pair.

Sequences_MinDistance <- 
  Sequences_distance %>% 
  melt(id.vars = 'Sequence.x',
     variable.name = 'Sequence.y',
     value.name = 'Distance') %>% 
  as.data.table() %>% 
  .[, .(Distance = min(Distance)),
     by = .(Sequence.x, Sequence.y)]

Sequences_MinDistance

As an example, below is shown the sequences that are the most similar to /pl/.

pl <- Sequences_MinDistance %>% 
  filter(Sequence.x == 'pl',
         Sequence.y != '∅') %>% 
  arrange(Distance)

pl

As another example, below is shown the sequences that are the most similar to /ia/.

ia <- Sequences_MinDistance %>% 
  filter(Sequence.x == 'ia',
         Sequence.y != '∅') %>% 
  arrange(Distance)

ia

Measuring the distance between lects

In this section, I will show how I measure the phonological distance between two lects.

First, I redefine sequences with new (decapitalized and unbracketed) sequences.

NewSequences <- Segments %>% 
  select(Lect, Category, Sequence) %>% 
  as.data.table()

NewSequences

I then calculate the distance between two lects within the same category (onset, nucleus, or coda).

The distance between the onset/nucleus/coda sequences of two lects is defined as follows. Let M1 and M2 be the matrices representing the phonological feature values of the onset/nucleus/coda sequences of lect 1 and lect 2, respectively. Let MD be the distance matrix between M1 and M2. The distance between the onset/nucleus/coda forms of lect 1 and 2 is the average value of the minimum values of rows or columns of MD, whichever is higher.

As an example, suppose that lect A allows three onset sequences, /p m t/, and lect B two onset sequences, /p t/. The comparison between A and B is shown in the table below. The last column and the last row shows the minimum value of each row and each column, respectively. The average value of the minimum column (the comparison from A to B) is ca. 1.1547, whereas the average value of the minimum row (the comparison from B to A) is 0. The biggest value of these two (1.1547) is selected. Thus, the onset distance between A and B is ca. 1.1547.

A_vs_B <- data.table(A_B = c('p', 'm', 't', 'min'),
                     p = c(0, 3.46, 3, 0),
                     t = c(3, 4.58, 0, 0),
                     min = c(0, 3.46, 0, NA))

A_vs_B

Using this formula, I calculate the distance of onset, nucleus, and coda of each pair of lects.

ONC_distance <- NewSequences %>% 
  left_join(NewSequences, by = 'Category') %>% 
  left_join(Sequences_MinDistance, 
            by = c('Sequence.x', 'Sequence.y')) %>% 
  .[, .(Distance = min(Distance)),
              by = .(Lect.x, Lect.y, Sequence.x, Category)] %>% 
  .[, .(Distance = mean(Distance)), 
              by = .(Lect.x, Lect.y, Category)] %>%
  mutate(Lect_vs_Lect = str_c(pmin(Lect.x, Lect.y), 
                      'vs.', 
                      pmax(Lect.x, Lect.y),
                      sep = ' ')) %>% 
  .[, .(Distance = max(Distance)), 
               by = .(Lect_vs_Lect, Category)] %>% 
   dcast(., Lect_vs_Lect ~ Category, value.var = 'Distance')
  
ONC_distance

Next, I calculate the distance between the tonality of each pair of lects.

The distance between tonality is defined as the Canberra distance between the numbers of tonemes of two lects. Let T1 be the number of tonemes lect 1 has, and T2 the number of tonemes lect 2 has. The distance betwen lect 1 and lect 2 is \(\frac{|T1 - T2|}{(T1 + T2)}\).

For example, Japanese has 1 toneme (high pitch), whereas Mandarin has 4. The tonal distance (T) between two lects is thus \(\frac{|1 - 4|}{(1 + 4)} = \frac{3}{5}\).

If both lects have 0 tonemes, then the distance between the two lects is 0.

First, I count the number of tones in each lect:

Tones <- Eurasia %>%
  select(Lect, `T`) %>%
  mutate(`T` = gsub("\\-", NA, `T`)) %>%
  mutate(`T` = str_count(`T`, " ") + 1)

Tones[is.na(Tones$`T`),]$`T` <- 0

Tones

I then calculate the Canberra distance between the numbers of tones of each pair of lects.

Tones_distance <- dist(Tones, method = 'canberra') %>% 
  as.matrix() %>%
  as.data.table() %>% 
  setnames(Tones$Lect) %>% 
  mutate(Lect = Tones$Lect) %>% 
  melt(id = 'Lect', 
       variable.name = 'Lect2', 
       value.name = 'T') %>%
  mutate(`T` = replace_na(`T`, 0)) %>% 
  mutate(Lect_vs_Lect = str_c(pmin(as.character(Lect), as.character(Lect2)),
                                'vs.',
                                pmax(as.character(Lect), as.character(Lect2)),
                               sep = ' ')) %>%
  select(Lect_vs_Lect, `T`) %>%
  distinct()
## Warning in dist(Tones, method = "canberra"): NAs introduced by coercion
Tones_distance

I join segmental distance with tonal distance.

ONCT_distance <- ONC_distance %>% 
  full_join(Tones_distance)

ONCT_distance

I then calculate the overall distance.

PhonoDist <- ONCT_distance %>% 
  select(-Lect_vs_Lect) %>% 
  scale() %>% 
  as.data.table() %>% 
  mutate(Lect_vs_Lect = ONCT_distance$Lect_vs_Lect) %>% 
  mutate(Distance = sqrt((O - min(O)) ^ 2 +
                         (N - min(N)) ^ 2 +
                         (C - min(C)) ^ 2 +
                         (`T` - min(`T`)) ^ 2)) %>% 
  select(Lect_vs_Lect, Distance)

PhonoDist

The table below shows the lects that are the closest to Yue Chinese (aka Cantonese). We see that most of the top ten closest lects are spoken in Mainland Southeast Asia or the Tibetan Plateau. Surprisingly, no Sinitic lect is among the top ten.

Yue <- PhonoDist %>% 
  filter(grepl('Yue Chinese', Lect_vs_Lect)) %>%
  arrange(Distance)

Yue

The table below shows the ten lects the most similar to Hindi. Unsurprisingly, most of the top ten lects are spoken in South Asia.

Hindi <- PhonoDist %>% 
  filter(grepl('Hindi', Lect_vs_Lect)) %>%
  arrange(Distance)

Hindi

The table below shows the ten lects the most similar to German. Again, unsurprisingly, most of the top ten lects are spoken in Europe.

German <- PhonoDist %>% 
  filter(grepl('German', Lect_vs_Lect)) %>%
  arrange(Distance)

German

The table below shows the ten lects the most similar to Halh Mongolian. Most of the closest lects are spoken in central Eurasia, belonging to diverse language families (Tungusic, Indo-European, Turkic, etc.)

Halh <- PhonoDist %>% 
  filter(grepl('Halh Mongolian', Lect_vs_Lect)) %>%
  arrange(Distance)

Halh

Clustering the lects

Based on these distances, I perform two analyses: Multidimensional scaling and k-means clustering, in order to cluster similar lects together and detect areal patterns.

First, I make a comprehensive list of a lect versus another lect.

Lect_vs_Lect <- str_split_fixed(
  PhonoDist$Lect_vs_Lect, ' vs. ', n = 2) %>% 
  as.data.table() %>% 
  setnames(c('Lect.x', 'Lect.y'))

Lect_vs_Lect

Then, I conduct multidimensional scaling based on the phonological distances, the number of dimensions being maximal, i. e. the number of sample lects minus one.

PhonoScale <- PhonoDist %>% 
  bind_cols(Lect_vs_Lect) %>% 
  select(-Lect_vs_Lect) %>% 
  dcast(Lect.x ~ Lect.y, value.var = 'Distance') %>% 
  column_to_rownames('Lect.x') %>% 
  t() %>% 
  as.dist() %>% 
  cmdscale(k = nrow(Eurasia) - 1) %>%
  as.data.frame() %>% 
  rownames_to_column('Lect') %>% 
  as.data.table()
## Warning in cmdscale(., k = nrow(Eurasia) - 1): only 123 of the first 208
## eigenvalues are > 0
PhonoScale[1:10, 1:10]

Next, based on the multidimensional scaling, I will perform k-means clustering of two, five, and ten clusters, in order to see if the clusters formed based on phonological distance consist of areally close lects.

I cluster the lects into two groups.

K2 <- PhonoScale %>% 
  select(-Lect) %>% 
  kmeans(2) %>% 
  pluck(1) %>% 
  as_factor()

K2
##   [1] 2 2 1 2 2 1 1 1 1 2 1 1 1 1 1 1 1 1 2 2 2 1 2 1 2 1 1 2 1 2 2 2 1 1 1 1 2
##  [38] 1 2 1 2 1 1 1 1 2 1 1 1 1 2 1 2 1 1 2 2 1 1 2 1 2 2 1 1 1 2 1 1 2 1 2 1 1
##  [75] 1 2 1 1 1 2 2 1 1 2 1 2 2 1 2 1 1 2 1 1 2 1 1 1 1 2 2 1 2 2 1 1 1 1 1 1 1
## [112] 1 2 1 2 1 2 2 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 2 1 2 1 2 2 1 1 1 1 2 1 2
## [149] 1 1 1 1 2 1 2 2 1 1 1 2 1 2 1 2 2 1 1 2 2 1 2 2 1 1 1 1 1 1 2 2 2 1 2 1 1
## [186] 1 1 1 1 1 2 1 1 1 1 2 2 2 2 2 1 1 2 2 2 2 2 1 2
## Levels: 1 2

I cluster the lects into five groups.

K5 <- PhonoScale %>% 
  select(-Lect) %>% 
  kmeans(5) %>% 
  pluck(1) %>% 
  as_factor()

K5
##   [1] 2 4 3 2 2 1 5 3 4 4 1 1 3 1 1 3 4 4 4 3 2 3 2 1 2 1 1 2 1 2 2 2 4 3 1 4 3
##  [38] 1 2 5 2 1 4 1 5 2 3 1 3 5 2 4 2 3 4 2 2 4 1 4 5 1 2 4 5 1 4 4 5 2 1 2 1 1
##  [75] 1 2 5 4 4 2 3 3 1 1 1 2 3 4 3 4 4 2 1 1 2 4 1 1 4 2 1 1 1 2 4 5 3 5 4 1 1
## [112] 4 2 3 3 1 2 2 4 4 4 1 4 4 3 1 2 2 1 5 3 1 3 3 2 2 1 3 5 2 2 1 5 3 4 2 3 4
## [149] 1 1 5 1 2 1 2 2 4 4 1 2 1 3 3 4 1 1 1 4 1 4 2 2 1 3 1 1 3 1 2 2 1 1 2 1 1
## [186] 1 1 1 1 1 3 3 5 1 3 2 3 2 2 2 3 3 2 2 1 3 2 3 2
## Levels: 1 2 3 4 5

I cluster the lects into ten groups.

K10 <- PhonoScale %>% 
  select(-Lect) %>% 
  kmeans(10) %>% 
  pluck(1) %>% 
  `-`(1) %>% 
  as_factor()

K10
##   [1] 4 9 5 1 4 3 7 5 8 9 3 3 5 3 3 5 8 8 8 6 4 6 4 3 4 3 8 0 3 4 6 4 2 5 3 8 6
##  [38] 3 4 7 0 3 8 3 7 1 6 3 5 7 0 8 1 5 8 1 1 8 3 9 2 6 4 8 7 3 9 8 7 4 3 6 3 3
##  [75] 3 4 7 2 8 0 6 5 3 3 3 4 0 8 6 8 8 1 3 3 1 8 3 8 8 6 3 5 6 6 2 5 5 7 9 3 3
## [112] 8 4 5 6 3 4 4 0 2 8 3 8 8 6 3 0 1 3 7 5 3 5 6 6 4 3 6 7 6 4 3 7 5 8 4 5 8
## [149] 3 3 7 3 4 3 6 1 9 2 3 1 3 1 5 9 6 3 3 8 6 8 6 4 3 5 3 3 5 3 4 6 6 3 4 8 3
## [186] 3 3 3 3 3 6 5 7 3 5 1 0 1 4 1 5 5 0 0 6 6 1 5 6
## Levels: 0 1 2 3 4 5 6 7 8 9

For visualization, I make a list of lects and their geographical coordinates.

Lect_LonLat <- PhonoBib %>% 
  filter(Lect %in% Eurasia$Lect) %>% 
  select(Lect, lon, lat)

Lect_LonLat

I add clusters and geographical coordinates to the table of phonological distances.

PhonoClusters <- PhonoScale %>% 
  mutate(K2 = K2,
         K5 = K5,
         K10 = K10) %>% 
  select(Lect, K2, K5, K10) %>% 
  left_join(Lect_LonLat)

PhonoClusters

In order to plot the lects onto a map, I first load map data.

map <- map_data("world")

head(world)
##                                                                  
## 1 function (resolution = 5, level = 0, path, version = "latest", 
## 2     ...)                                                       
## 3 {                                                              
## 4     stopifnot(level[1] == 0)                                   
## 5     resolution = round(resolution[1])                          
## 6     stopifnot(resolution %in% 1:5)

Then I create a map of Eurasia.

EurasiaMap <- ggplot(map, aes(x = long, y = lat)) + 
  geom_polygon(aes(group = group),
               fill = "white", 
               color = "darkgrey", 
               size = 0.2) +
  coord_map("ortho",
            orientation = c(20, 70, 0),
            xlim = c(10, 130),
            ylim = c(0, 90)) +
  theme_void()

EurasiaMap

I assign the two clusters on the map, each integer in different colors representing different clusters. We see that East and Southeast Asia are distinct from the rest of the macroarea.

PhonoK2 <- EurasiaMap +
  geom_text(aes(x = lon,
                 y = lat,
                 label = K2,
                 color = K2),
             data = PhonoClusters,
            show.legend = FALSE) +
  theme(legend.position = 'bottom')

PhonoK2

I assign the five clusters on the map.

PhonoK5 <- EurasiaMap +
  geom_text(aes(x = lon,
                 y = lat,
                 label = K5,
                 color = K5),
             data = PhonoClusters,
            show.legend = FALSE) +
  theme(legend.position = 'bottom')

PhonoK5

I assign the ten clusters on the map.

PhonoK10 <- EurasiaMap +
  geom_text(aes(x = lon,
                 y = lat,
                 label = K10,
                 color = K10),
             data = PhonoClusters,
            show.legend = FALSE) +
  theme(legend.position = 'bottom')

PhonoK10

From the visualized k-means clustering, we can make the following observation: Phonological clusters also tend to form geographical clusters. Lects in Europe, Mainland South East Asia, Northeast Asia, and South Asia tend to form clusters. We can confirm that at least at the phonological level, these areas indeed form separate linguistic areas.

For the html version of this Markdown only, I will create a 3d plot visualizing the first three dimensions of the multidimensional scaling. First, I subset the first three dimensions of the multidimensional scaling and join the K-means clustering.

ThreeD <- PhonoScale %>%
  select(Lect, V1, V2, V3) %>%
  left_join(PhonoClusters)

head(ThreeD)

I then draw a 3D plot of multidimensional scaling, where each color represents one of the five clusters.

ThreeDPlot <- plot_ly(
            data = ThreeD,
            x = ~V1,
            y = ~V2,
            z = ~V3,
            color = ~K5,
            text = ~Lect,
            textfont = list(size = 15),
            type = 'scatter3d',
            mode = 'text') %>%
  layout(showlegend = FALSE)

ThreeDPlot

Testing the correlation between phonological and geographical distances

Lastly, I will test the following hypothesis: Geographical distance correlates with phonological distance. That is, geographically closer lects also tend to be phonologically similar.

In order to test this hypothesis, I first create a tibble of lects, coordinates, families, and a dummy column (kilometers).

Lect_Kilometers <- Lect_LonLat %>% 
  left_join(PhonoBib) %>% 
  select(Lect, lon, lat, Family) %>% 
  mutate(Kilometers = 'Kilometers')
  
Lect_Kilometers

I then create a table of lect vs. lect, their geographical coordinates, and their genealogical relationship (whether they belong to the same family or not).

Coordinates <- Lect_Kilometers %>% 
  full_join(Lect_Kilometers, by = 'Kilometers') %>% 
  mutate(Lect_vs_Lect = str_c(pmin(Lect.x, Lect.y),
                             "vs.",
                              pmax(Lect.x, Lect.y),
                              sep = " "),
  Crossfamilial = Family.x != Family.y)
         
Coordinates

I subset Coordinates x.

Coordinates.x <- select(Coordinates, lon.x, lat.x)

Coordinates.x

I subset Coordinates y.

Coordinates.y <- select(Coordinates, lon.y, lat.y)

Coordinates.y

I then calculate the geographical distances between two columns of coordinates. I leave out pairs of lects that belong to the same family, as lects belonging to the same family tend to be phonologically similar and also geographically closer.

GeoDist <- Coordinates %>% 
  mutate(Kilometers = 
           distHaversine(Coordinates.x, Coordinates.y) / 1000) %>% 
  filter(Crossfamilial) %>% 
  select(Lect_vs_Lect, Kilometers) %>% 
  distinct()

GeoDist

I join the phonological distances to the geographical distances.

PhonoGeoDist <- left_join(GeoDist, PhonoDist, by = 'Lect_vs_Lect')

PhonoGeoDist

I conduct linear regression between geographical distance and phonological distances.

PhonoGeoDist %>% 
  lm(formula = Distance ~ Kilometers) %>% 
  summary()
## 
## Call:
## lm(formula = Distance ~ Kilometers, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9728 -0.6582 -0.0037  0.6696  4.0808 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.395e+00  1.433e-02  306.63   <2e-16 ***
## Kilometers  1.281e-04  3.128e-06   40.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.01 on 18174 degrees of freedom
## Multiple R-squared:  0.08448,    Adjusted R-squared:  0.08443 
## F-statistic:  1677 on 1 and 18174 DF,  p-value: < 2.2e-16
PhonoGeoDist

Finally, I visualize the linear regression.

PhonoLM <- 
  ggplot(aes(x = Kilometers, y = Distance), data = PhonoGeoDist) +
  geom_point(alpha = 0.1) +
  geom_smooth(formula = y ~ x, method = 'lm', color = 'red') +
  theme_classic() +
  scale_x_continuous(name = 'Geographical distance (km)') +
  scale_y_continuous(name = 'Phonological distance (z)')

PhonoLM

We see that there is a correlation between the geographical distance and the phonological distance between lects that are not genealogically related. We can thus make the following conclusion: Geographically close lects tend to phonologically converge.

References (within Markdown)

Dowle, Matt, and Arun Srinivasan. 2022. Data.table: Extension of ‘Data.frame‘. https://CRAN.R-project.org/package=data.table.
Fairbanks, Mark. 2022. Tidytable: Tidy Interface to ’Data.table’. https://CRAN.R-project.org/package=tidytable.
Mortensen, David R, Patrick Littell, Akash Bharadwaj, Kartik Goyal, Chris Dyer, and Lori Levin. 2016. “Panphon: A Resource for Mapping IPA Segments to Articulatory Feature Vectors.” In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 3475–84.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

  1. I thank Huisu Yun for suggesting this idea to me.↩︎